32 research outputs found

    Survey of Species Covered by DNA Barcoding Data in BOLD and GenBank for Integration of Data for Museomics

    No full text
    DNA barcoding technology has become employed widely for biodiversity and molecular biology researchers to identify species and analyze their phylogeny. Recently, DNA metabarcoding and environmental DNA (eDNA) technology have developed by expanding the concept of DNA barcoding. These techniques analyze the diversity and quantity of organisms within an environment by detecting biogenic DNA in water and soil. It is particularly popular for monitoring fish species living in rivers and lakes (Takahara et al. 2012). BOLD Systems (Barcode of Life Database systems, Ratnasingham and Hebert 2007) is a database for DNA barcoding, archiving 8.5 million of barcodes (as of August 2020) along with the voucher specimen, from which the DNA barcode sequence is derived, including taxonomy, collected country, and museum vouchered as metadata (e.g. https://www.boldsystems.org/index.php/Public_RecordView?processid=TRIBS054-16). Also, many barcoding data are submitted to GenBank (Sayers et al. 2020), which is a database for DNA sequences managed by NCBI (National Center for Biotechnology Information, US). The number of the records of DNA barcodes, i.e. COI (cytochrome c oxidase I) gene for animal, has grown significantly (Porter and Hajibabaei 2018). BOLD imports DNA barcoding data from GenBank, and lots of DNA barcoding data in GenBank are also assigned BOLD IDs. However, we have to refer to both BOLD and GenBank data when performing DNA barcoding. I have previously investigated the registration of DNA barcoding data in GenBank, especially the association with BOLD, using insects and flowering plants as examples (Nakazato 2019). Here, I surveyed the number of species covered by BOLD and GenBank. I used fish data as an example because eDNA research is particularly focused on fish. I downloaded all GenBank files for vertebrates from NCBI FTP (File Transfer Protocol) sites (as of November 2019). Of the GenBank fish entries, 86,958 (7.3%) were assigned BOLD identifiers (IDs). The NCBI taxonomy database has registrations for 39,127 species of fish, and 20,987 scientific names at the species level (i.e., excluding names that included sp., cf. or aff.). GenBank entries with BOLD IDs covered 11,784 species (30.1%) and 8,665 species-level names (41.3%). I also obtained whole "specimens and sequences combined data" for fish from BOLD systems (as of November 2019). In the BOLD, there are 273,426 entries that are registered as fish. Of these entries, 211,589 BOLD entries were assigned GenBank IDs, i.e. with values in “genbank_accession” column, and 121,748 entries were imported from GenBank, i.e. with "Mined from GenBank, NCBI" description in "institution_storing" column. The BOLD data covered 18,952 fish species and 15,063 species-level names, but 35,500 entries were assigned no species-level names and 22,123 entries were not even filled with family-level names. At the species level, 8,067 names co-occurred in GenBank and BOLD, with 6,997 BOLD-specific names and 599 GenBank-specific names.GenBank has 425,732 fish entries with voucher IDs, of which 340,386 were not assigned a BOLD ID. Of these 340,386 entries, 43,872 entries are registrations for COI genes, which could be candidates for DNA barcodes. These candidates include 4,201 species that are not included in BOLD, thus adding these data will enable us to identify 19,863 fish to the species level.For researchers, it would be very useful if both BOLD and GenBank DNA barcoding data could be searched in one place. For this purpose, it is necessary to integrate data from the two databases. A lot of biodiversity data are recorded based on the Darwin Core standard while DNA sequencing data are sometimes integrated or cross-linked by RDF (Resource Description Framework). It may not be technically difficult to integrate these data, but the species data referenced differ from the EoL (The Encyclopedia of Life) for BOLD and the NCBI taxonomy for GenBank, and the differences in taxonomic systems make it difficult to match by scientific name description. GenBank has fields for the latitude and longitude of the specimens sampled, and Porter and Hajibabaei 2018 argue that this information should be enhanced. However, this information may be better described in the specimen and occurrence databases. The integration of barcoding data with the specimen and occurrence data will solve these problems. Most importantly, it will save the researcher from having to register the same information in multiple databases. In the field of biodiversity, only DNA barcode sequences may have been focused on and used as gene sequences. The museomics community regards museum-preserved specimens as rich resources for DNA studies because their biodiversity information can accompany the extraction and analysis of their DNA (Nakazato 2018). GenBank is useful for biodiversity studies due to its low rate of mislabelling (Leray et al. 2019). In the future, we will be working with a variety of DNA, including genomes from museum specimens as well as DNA barcoding. This will require more integrated use of biodiversity information and DNA sequence data. This integration is also of interest to molecular biologists and bioinformaticians

    Current situation of DNA Barcoding data in biodiversity and genomics databases and data integration for museomics

    No full text
    The museomics activity regards museum-preserved specimens as rich resources for DNA studies by extracting and analyzing DNA from these specimens in conjunction with their biodiversity information. Also in biodiversity field, DNA sequence data such as DNA barcoding has become essential as evidence for species identification and phylogenetic analysis as well as occurrence and morphological information. To accelerate biodiversity informatics, it is important to utilize both biodiversity occurrence and morphology data, and bioinformatics sequencing data. There are many databases for biodiversity domain such as GBIF (The Global Biodiversity Information Facility) for species occurrence records, EoL (The Encyclopedia of Life) as a knowledge base of all species, and BOLD (The Barcode of Life Data) for DNA barcoding data. In genomics science, molecular data involving DNA and protein sequences have been captured by the DNA Data Bank in Japan (DDBJ), the European Bioinformatics Institute (EBI, UK), and the National Center for Biotechnology Information (NCBI, US) under the International Nucleotide Sequence Database Collaboration (INSDC) for more than 30 years. Recently, NCBI launched a new database called BioCollections, including 7,930 culture collections, museums, herbaria, and other natural history collections. In addition, we can submit biodiversity information such as specimen voucher IDs, BOLD IDs, and latitude/longitude with DNA sequences. To find out the current situation, I downloaded GenBank (Nucleotide) files (updated at 22 Feb 2019) from the NCBI FTP (file transfer protocol) site and extracted biodiversity features including specimen voucher IDs and BOLD IDs. For Insecta, there are 2,427,343 sequence entries with specimen voucher ID and 1,766,142 entries with BOLD ID of 3,389,495 total entries. The most abundant species with voucher IDs is “Cecidomyiidae sp. BOLD−2016” (Diptera) (35,861 sequence entries). The most frequently referred voucher ID is “USNMENT00921257” (1510 sequence entries), indicating Stenamma megamanni (Hymenoptera, Formicidae, Myrmicinae). For flowering plants (Magnoliophyta), of 3,094,140 total entries, 1,109,420 sequence entries are assigned with voucher IDs and 73,409 entries with BOLD IDs. Additionally, 79,891 matK entries and 63,821 rbcL entries are submitted with voucher IDs, without BOLD IDs. I also retrieved BOLD data for Insecta and flowering plants. The 2,368,801 GenBank entries are referred from 4,176,481 BOLD total entries for Insecta, and the 259,245 GenBank entries from 345,706 BOLD entries for flowering plants. Some DNA barcoding data exist redundantly in BOLD database because BOLD imports sequences from NCBI submitted as DNA barcoding data in BOLD. These entries have different BOLD IDs but same BIN_URL is assigned. Recently, high-throughput sequencing technology, also called next-generation sequencing technology (NGS), has made a great impact in genomic science. Biodiversity researchers became to perform not only DNA barcoding but also RNA-Seq with NGS. NGS also accelerates museomics activity. NGS data are archived to the Sequence Read Archive (SRA) database, and sample information is described in BioSample database in INSDC. To utilize NGS data for biodiversity field, we will need to integrate such databases and other biodiversity databases. We, Database Center for Life Science, tackle to integrate life science data with Semantic Web technology. We held annual meetings to integrate life science data, called BioHackathons, in which researchers from all over the world participated. We began to RDFize BioSample data, but we should import existing schemes used in the biodiversity field including Darwin Core

    Knowledge Extraction from Specimen-Derived Data from GenBank to Enrich Biodiversity Information

    No full text
    DNA barcoding and environmental DNA (eDNA) are increasing the need for the utilization of gene sequences in the field of biodiversity. GBIF (Global Biodiversity Information Facility) and GGBN (Global Genome Biodiversity Network) are taking action on the treatment of gene sequences in the field of biodiversity (Finstad et al. 2020). Gene sequences have been collected and published by INSDC (International Nucleotide Sequence Database Collaboration) for over 30 years (Arita et al. 2020). Biodiversity information has been collected using standards such as Darwin Core (Wieczorek et al. 2012), but INSDC gene sequences are stored in their own format. In the field of bioinformatics, researchers are also organizing the BioHackathon series, notably the NBDC/DBCLS BioHackathon and the spin-off Biohackathon Europe,  to standardize data through the Semantic Web (Garcia Castro et al. 2021, Vos et al. 2020), but the linkage with biodiversity information has just begun.In this study, as an example of linking gene sequence information with biodiversity information, I attempted to construct an infrastructure for knowledge extraction by utilising gene sequence entries derived from museum specimens from GenBank (Sayers et al. 2020). I have previously surveyed the BOLD (The Barcode of Life Data System) (Ratnasingham and Hebert 2007) IDs listed in GenBank (Nakazato 2020). I downloaded the fish and insect data from the GenBank FTP (file transfer protocol) site. Then I extracted the descriptions in the "specimen_voucher" field and obtained 749,627 (28% of the fish entries in GenBank) and 1,621,890 (13%) specimen IDs, respectively. I also extracted from the "note" field approximately 1000 entries describing the type of the specimen, such as "holotype", "lectotype", and "paratype". These extracts include descriptions written in natural language. NCBI (National Center for Biotechnology Information) publishes the BioCollections database (Sharma et al. 2019), and these data may be able to refine the description.In the future, I plan to map these extracted IDs to the collection IDs in the biodiversity information database. This will enable us to enrich the biodiversity information with GenBank descriptions, for example, by adding articles listed in GenBank as references to the specimen data

    Experimental design-based functional mining and characterization of high-throughput sequencing data in the sequence read archive.

    Get PDF
    High-throughput sequencing technology, also called next-generation sequencing (NGS), has the potential to revolutionize the whole process of genome sequencing, transcriptomics, and epigenetics. Sequencing data is captured in a public primary data archive, the Sequence Read Archive (SRA). As of January 2013, data from more than 14,000 projects have been submitted to SRA, which is double that of the previous year. Researchers can download raw sequence data from SRA website to perform further analyses and to compare with their own data. However, it is extremely difficult to search entries and download raw sequences of interests with SRA because the data structure is complicated, and experimental conditions along with raw sequences are partly described in natural language. Additionally, some sequences are of inconsistent quality because anyone can submit sequencing data to SRA with no quality check. Therefore, as a criterion of data quality, we focused on SRA entries that were cited in journal articles. We extracted SRA IDs and PubMed IDs (PMIDs) from SRA and full-text versions of journal articles and retrieved 2748 SRA ID-PMID pairs. We constructed a publication list referring to SRA entries. Since, one of the main themes of -omics analyses is clarification of disease mechanisms, we also characterized SRA entries by disease keywords, according to the Medical Subject Headings (MeSH) extracted from articles assigned to each SRA entry. We obtained 989 SRA ID-MeSH disease term pairs, and constructed a disease list referring to SRA data. We previously developed feature profiles of diseases in a system called "Gendoo". We generated hyperlinks between diseases extracted from SRA and the feature profiles of it. The developed project, publication and disease lists resulting from this study are available at our web service, called "DBCLS SRA" (http://sra.dbcls.jp/). This service will improve accessibility to high-quality data from SRA

    MeSH ORA framework: R/Bioconductor packages to support MeSH over-representation analysis

    Get PDF
    Background: In genome-wide studies, over-representation analysis (ORA) against a set of genes is an essential step for biological interpretation. Many gene annotation resources and software platforms for ORA have been proposed. Recently, Medical Subject Headings (MeSH) terms, which are annotations of PubMed documents, have been used for ORA. MeSH enables the extraction of broader meaning from the gene lists and is expected to become an exhaustive annotation resource for ORA. However, the existing MeSH ORA software platforms are still not sufficient for several reasons. Results: In this work, we developed an original MeSH ORA framework composed of six types of R packages, including MeSH.db, MeSH.AOR.db, MeSH.PCR.db, the org.MeSH.XXX.db-type packages, MeSHDbi, and meshr. Conclusions: Using our framework, users can easily conduct MeSH ORA. By utilizing the enriched MeSH terms, related PubMed documents can be retrieved and saved on local machines within this framework

    List of top 10 diseases extracted from the Sequence Read Archive (SRA).

    No full text
    <p>We extracted disease terms of the Medical Subject Headings (MeSH) from assigned journal articles referring to the SRA entries. The MeSH disease category contains not only the disease name but also symptoms. The OMIM ID was converted from the MeSH terms to the Disease Name by using the Disease Ontology (DO). “Lung Neoplasms” should be assigned to the OMIM entry “Lung Cancer” (OMIM ID: 211980); however, there is no link in the DO.</p
    corecore